Clustering in life sciences.

نویسندگان

  • Ying Zhao
  • George Karypis
چکیده

Clustering is the task of organizing a set of objects into meaningful groups. These groups can be disjoint, overlapping, or organized in some hierarchical fashion. The key element of clustering is the notion that the discovered groups are meaningful. This definition is intentionally vague, as what constitutes meaningful is to a large extent, application dependent. In some applications this may translate to groups in which the pairwise similarity between their objects is maximized, and the pairwise similarity between objects of different groups is minimized. In some other applications this may translate to groups that contain objects that share some key characteristics, even though their overall similarity is not the highest. Clustering is an exploratory tool for analyzing large datasets, and has been used extensively in numerous application areas. Clustering has a wide range of applications in life sciences and over the years has been used in many areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. For example, clustering algorithms applied to gene expression data can be used to identify co-regulated genes and provide a genetic fingerprint for various diseases. Clustering algorithms applied on the entire database of known proteins can be used to automatically organize the different proteins into closeand distant-related families, and identify subsequences that are mostly preserved across proteins [52, 22, 55, 68, 49]. Similarly, clustering algorithms applied to the tertiary structural datasets can be used to perform a similar organization and provide insights in the rate of change between sequence and structure [20, 65]. The primary goal of this chapter is to provide an overview of the various issues involved in clustering large datasets, describe the merits and underlying assumptions of some of the commonly used clustering approaches, and provide insights on how to cluster datasets arising in various areas within life-sciences. Toward this end, the chapter is organized in broadly three parts. The first part (Sections 2– 4) describes the various types of clustering algorithms developed over the years, the various methods for computing the similarity between objects arising in life sciences, and methods for assessing the quality of the clusters. The second part (Section 5) focuses on the problem of clustering data arising from microarray experiments and describes some of the commonly used approaches. Finally, the third part (Section 6) provides a brief introduction to CLUTO, a general purpose toolkit for clustering various datasets, with an emphasis on its applications to problems and analysis requirements within life sciences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparing Model-based Versus K-means Clustering for the Planar Shapes

‎In some fields‎, ‎there is an interest in distinguishing different geometrical objects from each other‎. ‎A field of research that studies the objects from a statistical point of view‎, ‎provided they are‎ ‎invariant under translation‎, ‎rotation and scaling effects‎, ‎is known as the statistical shape analysis‎. ‎Having some objects that are registered using key points on the outline...

متن کامل

Data clustering in life sciences.

Clustering has a wide range of applications in life sciences and over the years has been used in many areas ranging from the analysis of clinical information, phylogeny, genomics, and proteomics. The primary goal of this article is to provide an overview of the various issues involved in clustering large biological datasets, describe the merits and underlying assumptions of some of the commonly...

متن کامل

Fuzzy Clustering Based Routing in Wireless Body Area Networks to Increase the Life of Sensor Nodes

Body area networks is one of the types of wireless area networks which has been created to optimize utilizing hospital resources and for earlier diagnosis of medical symptoms, and ultimately to reduce the cost of medical care. This network like most of the wireless networks is without infrastructure and the embedded sensor nodes in the body have limited energy. Hence, the early power completion...

متن کامل

Retaining Customers Using Clustering and Association Rules in Insurance Industry: A Case Study

This study clusters customers and finds the characteristics of different groups in a life insurance company in order to find a way for prediction of customer behavior based on payment. The approach is to use clustering and association rules based on CRISP-DM methodology in data mining. The researcher could classify customers of each policy in three different clusters, using association rules. A...

متن کامل

A Hybrid Data Clustering Algorithm Using Modified Krill Herd Algorithm and K-MEANS

Data clustering is the process of partitioning a set of data objects into meaning clusters or groups. Due to the vast usage of clustering algorithms in many fields, a lot of research is still going on to find the best and efficient clustering algorithm. K-means is simple and easy to implement, but it suffers from initialization of cluster center and hence trapped in local optimum. In this paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Methods in molecular biology

دوره 224  شماره 

صفحات  -

تاریخ انتشار 2003